17 research outputs found

    What factors influence the popularity of user-generated text in the creative domain? A case study of book reviews

    Full text link
    This study investigates a range of psychological, lexical, semantic, and readability features of book reviews to elucidate the factors underlying their perceived popularity. To this end, we conduct statistical analyses of various features, including the types and frequency of opinion and emotion-conveying terms, connectives, character mentions, word uniqueness, commonness, and sentence structure, among others. Additionally, we utilize two readability tests to explore whether reading ease is positively associated with review popularity. Finally, we employ traditional machine learning classifiers and transformer-based fine-tuned language models with n-gram features to automatically determine review popularity. Our findings indicate that, with the exception of a few features (e.g., review length, emotions, and word uniqueness), most attributes do not exhibit significant differences between popular and non-popular review groups. Furthermore, the poor performance of machine learning classifiers using the word n-gram feature highlights the challenges associated with determining popularity in creative domains. Overall, our study provides insights into the factors underlying review popularity and highlights the need for further research in this area, particularly in the creative realm.Comment: Accepted in 22nd IEEE International Conference on Machine Learning and Applications (ICMLA), 202

    Comprehending Lexical and Affective Ontologies in the Demographically Diverse Spatial Social Media Discourse

    Full text link
    This study aims to comprehend linguistic and socio-demographic features, encompassing English language styles, conveyed sentiments, and lexical diversity within spatial online social media review data. To this end, we undertake a case study that scrutinizes reviews composed by two distinct and demographically diverse groups. Our analysis entails the extraction and examination of various statistical, grammatical, and sentimental features from these two groups. Subsequently, we leverage these features with machine learning (ML) classifiers to discern their potential in effectively differentiating between the groups. Our investigation unveils substantial disparities in certain linguistic attributes between the two groups. When integrated into ML classifiers, these attributes exhibit a marked efficacy in distinguishing the groups, yielding a macro F1 score of approximately 0.85. Furthermore, we conduct a comparative evaluation of these linguistic features with word n-gram-based lexical features in discerning demographically diverse review data. As expected, the n-gram lexical features, coupled with fine-tuned transformer-based models, show superior performance, attaining accuracies surpassing 95\% and macro F1 scores exceeding 0.96. Our meticulous analysis and comprehensive evaluations substantiate the efficacy of linguistic and sentimental features in effectively discerning demographically diverse review data. The findings of this study provide valuable guidelines for future research endeavors concerning the analysis of demographic patterns in textual content across various social media platforms.Comment: Accepted in 22nd IEEE International Conference on Machine Learning and Applications (ICMLA), 202

    Enhancing Efficiency and Privacy in Memory-Based Malware Classification through Feature Selection

    Full text link
    Malware poses a significant security risk to individuals, organizations, and critical infrastructure by compromising systems and data. Leveraging memory dumps that offer snapshots of computer memory can aid the analysis and detection of malicious content, including malware. To improve the efficacy and address privacy concerns in malware classification systems, feature selection can play a critical role as it is capable of identifying the most relevant features, thus, minimizing the amount of data fed to classifiers. In this study, we employ three feature selection approaches to identify significant features from memory content and use them with a diverse set of classifiers to enhance the performance and privacy of the classification task. Comprehensive experiments are conducted across three levels of malware classification tasks: i) binary-level benign or malware classification, ii) malware type classification (including Trojan horse, ransomware, and spyware), and iii) malware family classification within each family (with varying numbers of classes). Results demonstrate that the feature selection strategy, incorporating mutual information and other methods, enhances classifier performance for all tasks. Notably, selecting only 25\% and 50\% of input features using Mutual Information and then employing the Random Forest classifier yields the best results. Our findings reinforce the importance of feature selection for malware classification and provide valuable insights for identifying appropriate approaches. By advancing the effectiveness and privacy of malware classification systems, this research contributes to safeguarding against security threats posed by malicious software.Comment: Accepted in IEEE ICMLA-2023 Conferenc

    SSentiaA: A Self-Supervised Sentiment Analyzer for Classification From Unlabeled Data

    Get PDF
    In recent years, supervised machine learning (ML) methods have realized remarkable performance gains for sentiment classification utilizing labeled data. However, labeled data are usually expensive to obtain, thus, not always achievable. When annotated data are unavailable, the unsupervised tools are exercised, which still lag behind the performance of supervised ML methods by a large margin. Therefore, in this work, we focus on improving the performance of sentiment classification from unlabeled data. We present a self-supervised hybrid methodology SSentiA (Self-supervised Sentiment Analyzer) that couples an ML classifier with a lexicon-based method for sentiment classification from unlabeled data. We first introduce LRSentiA (Lexical Rule-based Sentiment Analyzer), a lexicon-based method to predict the semantic orientation of a review along with the confidence score of prediction. Utilizing the confidence scores of LRSentiA, we generate highly accurate pseudo-labels for SSentiA that incorporates a supervised ML algorithm to improve the performance of sentiment classification for less polarized and complex reviews. We compare the performances of LRSentiA and SSSentA with the existing unsupervised, lexicon-based and self-supervised methods in multiple datasets. The LRSentiA performs similarly to the existing lexicon-based methods in both binary and 3-class sentiment analysis. By combining LRSentiA with an ML classifier, the hybrid approach SSentiA attains 10%–30% improvements in macro F1 score for both binary and 3-class sentiment analysis. The results suggest that in domains where annotated data are unavailable, SSentiA can significantly improve the performance of sentiment classification. Moreover, we demonstrate that using 30%–60% annotated training data, SSentiA delivers similar performances of the fully labeled training dataset

    \u3ci\u3eSpaghetti Tracer\u3c/i\u3e: A Framework for Tracing Semiregular Filamentous Densities in 3D Tomograms

    Get PDF
    Within cells, cytoskeletal filaments are often arranged into loosely aligned bundles. These fibrous bundles are dense enough to exhibit a certain regularity and mean direction, however, their packing is not sufficient to impose a symmetry between—or specific shape on—individual filaments. This intermediate regularity is computationally difficult to handle because individual filaments have a certain directional freedom, however, the filament densities are not well segmented from each other (especially in the presence of noise, such as in cryo-electron tomography). In this paper, we develop a dynamic programming-based framework, Spaghetti Tracer, to characterizing the structural arrangement of filaments in the challenging 3D maps of subcellular components. Assuming that the tomogram can be rotated such that the filaments are oriented in a mean direction, the proposed framework first identifies local seed points for candidate filament segments, which are then grown from the seeds using a dynamic programming algorithm. We validate various algorithmic variations of our framework on simulated tomograms that closely mimic the noise and appearance of experimental maps. As we know the ground truth in the simulated tomograms, the statistical analysis consisting of precision, recall, and F1 scores allows us to optimize the performance of this new approach. We find that a bipyramidal accumulation scheme for path density is superior to straight-line accumulation. In addition, the multiplication of forward and backward path densities provides for an efficient filter that lifts the filament density above the noise level. Resulting from our tests is a robust method that can be expected to perform well (F1 scores 0.86–0.95) under experimental noise conditions

    An Investigation of Atomic Structures Derived from X-ray Crystallography and Cryo-Electron Microscopy Using Distal Blocks of Side-Chains

    Get PDF
    Cryo-electron microscopy (cryo-EM) is a structure determination method for large molecular complexes. As more and more atomic structures are determined using this technique, it is becoming possible to perform statistical characterization of side-chain conformations. Two data sets were involved to characterize block lengths for each of the 18 types of amino acids. One set contains 9131 structures resolved using X-ray crystallography from density maps with better than or equal to 1.5 Ã… resolutions, and the other contains 237 protein structures derived from cryo-EM density maps with 2-4 Ã… resolutions. The results show that the normalized probability density function of block lengths is similar between the X-ray data set and the cryo-EM data set for most of the residue types, but differences were observed for ARG, GLU, ILE, LYS, PHE, TRP, and TYR for which conformations with certain shorter block lengths are more likely to be observed in the cryo-EM set with 2-4 Ã… resolutions

    Cylindrical Similarity Measurement for Helices in Medium-Resolution Cryo-Electron Microscopy Density Maps

    Get PDF
    Cryo-electron microscopy (cryo-EM) density maps at medium resolution (5-10 Å) reveal secondary structural features such as α-helices and β-sheets, but they lack the side chains details that would enable a direct structure determination. Among the more than 800 entries in the Electron Microscopy Data Bank (EMDB) of medium-resolution density maps that are associated with atomic models, a wide variety of similarities can be observed between maps and models. To validate such atomic models and to classify structural features, a local similarity criterion, the F1 score, is proposed and evaluated in this study. The F1 score is theoretically normalized to a range from zero to one, providing a local measure of cylindrical agreement between the density and atomic model of a helix. A systematic scan of 30,994 helices (among 3,247 protein chains modeled into medium-resolution density maps) reveals an actual range of observed F1 scores from 0.171 to 0.848, suggesting that the local similarity is quantified and discriminated as intended. The best (highest) F1 scores tend to be associated with regions that exhibit high and spatially homogeneous local resolution (between 5 Å to 7.5 Å) in the helical density. The proposed F1 scores can be used as a discriminative classifier for validation studies and as a ranking criterion for cryo-EM density features in databases.https://digitalcommons.odu.edu/gradposters2020_sciences/1001/thumbnail.jp

    A Tool for Segmentation of Secondary Structures in 3D Cryo-EM Density Map Components Using Deep Convolutional Neural Networks

    Get PDF
    Although cryo-electron microscopy (cryo-EM) has been successfully used to derive atomic structures for many proteins, it is still challenging to derive atomic structures when the resolution of cryo-EM density maps is in the medium resolution range, such as 5–10 Å. Detection of protein secondary structures, such as helices and β-sheets, from cryo-EM density maps provides constraints for deriving atomic structures from such maps. As more deep learning methodologies are being developed for solving various molecular problems, effective tools are needed for users to access them. We have developed an effective software bundle, DeepSSETracer, for the detection of protein secondary structure from cryo-EM component maps in medium resolution. The bundle contains the network architecture and a U-Net model trained with a curriculum and gradient of episodic memory (GEM). The bundle integrates the deep neural network with the visualization capacity provided in ChimeraX. Using a Linux server that is remotely accessed by Windows users, it takes about 6 s on one CPU and one GPU for the trained deep neural network to detect secondary structures in a cryo-EM component map containing 446 amino acids. A test using 28 chain components of cryo-EM maps shows overall residue-level F1 scores of 0.72 and 0.65 to detect helices and β-sheets, respectively. Although deep learning applications are built on software frameworks, such as PyTorch and Tensorflow, our pioneer work here shows that integration of deep learning applications with ChimeraX is a promising and effective approach. Our experiments show that the F1 score measured at the residue level is an effective evaluation of secondary structure detection for individual classes. The test using 28 cryo-EM component maps shows that DeepSSETracer detects β-sheets more accurately than Emap2sec+, with a weighted average residue-level F1 score of 0.65 and 0.42, respectively. It also shows that Emap2sec+ detects helices more accurately than DeepSSETracer with a weighted average residue-level F1 score of 0.77 and 0.72 respectively

    Tracing Actin Filament Bundles in Three-Dimensional Electron Tomography Density Maps of Hair Cell Stereocilia

    Get PDF
    Cryo-electron tomography (cryo-ET) is a powerful method of visualizing the three-dimensional organization of supramolecular complexes, such as the cytoskeleton, in their native cell and tissue contexts. Due to its minimal electron dose and reconstruction artifacts arising from the missing wedge during data collection, cryo-ET typically results in noisy density maps that display anisotropic XY versus Z resolution. Molecular crowding further exacerbates the challenge of automatically detecting supramolecular complexes, such as the actin bundle in hair cell stereocilia. Stereocilia are pivotal to the mechanoelectrical transduction process in inner ear sensory epithelial hair cells. Given the complexity and dense arrangement of actin bundles, traditional approaches to filament detection and tracing have failed in these cases. In this study, we introduce BundleTrac, an effective method to trace hundreds of filaments in a bundle. A comparison between BundleTrac and manually tracing the actin filaments in a stereocilium showed that BundleTrac accurately built 326 of 330 filaments (98.8%), with an overall cross-distance of 1.3 voxels for the 330 filaments. BundleTrac is an effective semi-automatic modeling approach in which a seed point is provided for each filament and the rest of the filament is computationally identified. We also demonstrate the potential of a denoising method that uses a polynomial regression to address the resolution and high-noise anisotropic environment of the density map
    corecore